Fast environmental sound classification based on resource adaptive convolutional neural network

Recently, with the construction of smart city, the research on environmental sound classification (ESC) has attracted the attention of academia and industry. The development of convolutional neural network (CNN) makes the accuracy of ESC reach a higher level, but the accuracy improvement brought by CNN is often accompanied by the deepening of network layers, which leads to the rapid growth of parameters and floating-point operations (FLOPs). Therefore, it is difficult to transplant CNN model to embedded devices, and the classification speed is also difficult to accept. In order to reduce the hardware requirements of running CNN and improve the speed of ESC, this paper proposes a resource adaptive convolutional neural network (RACNN). RACNN uses a novel resource adaptive convolutional (RAC) module, which can generate the same number of feature maps as conventional convolution operations more cheaply, and extract the time and frequency features of audio efficiently. The RAC block based on the RAC module is designed to build the lightweight RACNN model, and the RAC module can also be used to upgrade the existing CNN model. Experiments based on public datasets show that RACNN achieves higher performance than the state-of-the-art methods with lower computational complexity.

it generates the same number of channels with lower storage and operation cost and gets more abundant feature information. On its basis, the channel domain attention mechanism and skip connection are fused to generate an efficient feature extraction block and RACNN is formed by simply stacking this block. The specific process of ESC using RACNN is shown in Fig. 1.
In summary, the main contributions of this paper are as follows: (a) This paper proposes an efficient RAC module, which reduces the number of parameters and FLOPs of traditional convolution operation, and can trade-off between accuracy and efficiency according to the actual situation. (b) Combining the shortcut and the channel domain attention mechanism, we build an efficient feature extraction block-RAC block, and build RACNN by stacking this block. (c) Experiments based on public datasets show that RACNN achieves a trade-off between accuracy and efficiency.
The rest of this paper is organized as follows. "Related work " section introduces the related work. "Method " section introduces our RACNN and data preprocessing method. In "Experiment" section, the proposed method is verified by experiments. Finally, a summary of the whole paper is given.

Related work
Deep learning has been widely used in various fields. In recent years, many scholars have introduced this technology into the field of ESC. In this chapter, we will introduce deep learning methods applied in ESC-related fields and mainstream research on CNN compression.
The earliest and most commonly used CNN model in the field of ESC is 2-D CNN. Piczak 4 first proposed the use of 2-D CNN to learn Log-Mel spectrogram features, which has significantly improved ESC performance compared with traditional machine learning algorithms such as KNN and SVM. Chen et al. 5 accurately identified the audio signal of the vehicle by fusing the LSTM unit into the convolutional neural network. Boddapati et al. 6 uses AlexNet 7 and GoogLeNet 8 to classify the environmental sound features extracted from the spectrum. However, these CNN were used to classify the large image dataset-ImageNet at the earliest time. Therefore, these models are not fully suitable for the task of ESC, which is easy to cause overfitting, unable to give full play to CNN performance, cause redundancy of parameters and slow down the speed. Subsequently, many scholars began to study the influence of different spectrogram features on the final classification results. Tran et al. 9 proposed SirenNet and combined the original audio waveforms, MFCC and Log-Mel as input for emergency vehicle detection based on sirens. Later, Su et al. 10   www.nature.com/scientificreports/ features to form TSCNN-DS model, which achieved 97.2% classification accuracy on UrbanSound8K dataset. Su et al. 11 further analyzed the performance of ESC based on multi-aggregation acoustic features. Through a large number of experiments, the author found the best feature aggregation strategy among the feature combinations including MFCC, Log-Mel, Chroma, Spectral Contrast and Tonnetz to improve the accuracy of ESC. Finally, by fusing MFCC, Log-Mel, Spectral Contrast and Tonnetz, the accuracy of ESC-50 and UrbanSound8K is 85.6% and 93.4% respectively. In addition to the most commonly used 2-D CNN, many scholars carry out ESC tasks from the perspective of 1-D CNN. Zhang et al. 12 proposed an ESC method based on VGGNet 13 , and set the convolution filter to 1-D to learn the frequency and time characteristics of audio. Dai et al. 14 proposed a 34 layer 1-D CNN model to classify the original one-dimensional waveform data, and showed a competitive accuracy with 2-D CNN based on Log-Mel spectrogram, but it needs a deeper convolution layer. Abdoli et al. 15 proposed an end-to-end ESC method based on 1-D CNN, without artificial feature extraction. Antonio et al. 16 proposed DENet, which used lossless original audio as input, and combined the proposed layer with a bidirectional gated recurrent unit to obtain a good audio classification effect. Francisco et al. 17 developed the SinNet neural network architecture, which uses raw audio to classify animal sounds, and achieves rapid convergence in the case of limited data. Dong et al. 18 proposed a Two-Stream convolutional neural network. The model is composed of 1-D CNN based on raw audio and 2-D CNN based on Log-Mel spectrogram. It combines the time and frequency characteristics of audio and achieves 95.7% average accuracy and 96.07% highest accuracy on UrbanSound8K.
In order to make ESC-related research better serve practical applications, based on this research, researchers have carried out research on the task of sound event localization and detection (SELD). Shimada et al. 19 proposed a CRNN framework that combines CNN and RNN to realize the localization and detection of sound events, but the performance needs to be improved. Nguyen et al. 20 replaced the backbone network in CRNN with VGG and ResNet, and proposed a new SALSA feature, which finally achieved excellent performance. And the author also tested the performance of the combination of the backbone network and different RNN structures. Sun et al. 21 proposed Adaptive Hybrid Convolution based on the idea of matrix decomposition, and combined the attention module to obtain good results in the SLED task. Sudarsanam et al. 22 replaced the RNN blocks in the CRNN architecture with self-attention blocks. They also investigate stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and position embedding and layer normalization. With the rise of Transformer research, this structure has also been applied to SELD. Huang et al. 23 obtained performance no less than CRNN using the combination of CNN and Transformer.
In the above, we discuss a lot of ESC-related fields work based on CNN, but most of these works ignore one of the key issues in ESC tasks, that is real-time. Although Yousef et al. 24 once proposed to construct a simple shallow model and a single MFCC feature for ESC, the essence is still to simply stack convolutional layers. Although the shallow CNN model improves the real-time classification to a certain extent, its lower model capacity makes it difficult to improve the classification accuracy.
In order to improve the operating efficiency of CNN models, many researches on CNN compression have been proposed successively. Li et al. 25 proposed a neural network pruning method, which calculates the L 1 norm of the elements in the filter as the saliency measure, and removes filters with small metric value to obtain a " thinner " network, reduce the running cost of the model, and finally make up for the loss of accuracy through fine-tuning. Further, Valerio et al. 26 proposed a dynamic hard pruning method that progressively prunes low-contribution neurons during training, which not only reduces the size of the final neural network model, but also reduces the memory footprint during training, and accuracy loss due to the pruning operation is offset by a dynamic batch sizing method. Hinton et al. 27 proposed the idea of knowledge distillation. Soft goals related to the teacher network are introduced to guide students network training, thereby realizing knowledge transfer. However, this method ignores the important structural knowledge of the teacher network. Later Tian et al. 28 introduced contrastive learning into knowledge distillation to train student to capture significantly more information in the teacher's representation of the data. Chen et al. 29 proposed HashNet, which uses a hash function to group weights, and weights in the same hash bucket share the same value, thereby significantly reducing the model size. Dettmers 30 used 8-bit approximation data type instead of 32-bit floating-point representation to improve the running speed of the model, and designed a dynamic tree data type to reduce approximation errors. For the purpose of extreme acceleration, the binarization network is also developed accordingly. Courbariaux et al. 31 proposed BinaryConnect, which uses binarized weights during forward and backpropagation to train DNNs, but still maintains full-precision weights when computing gradients. Zhou et al. 32 proposed Incremental Network Quantization (INQ), which transforms a full-precision network model into a lossless binarized version through iterative weight division, population quantization, and retraining, and can be accelerated by hardware shifting.
Most of the above CNN compression methods are carried out on the basis of the existing classical models, and the performance of the methods is affected by the baseline models. In addition, these models are mostly used in the field of computer vision. In order to better serve the ESC task, we proposed a lightweight model RACNN, which reduces the memory footprint of training and inference processes, and maximizes model performance within limited resources (storage and computing resources).

Method
In this section, we introduce the proposed ESC method. First, we introduce the proposed RACNN model, and then we describe the preprocessing process of environmental sound data.
Proposed RACNN Model. Deep convolutional neural network usually improves its accuracy by a large number of stacking convolution operations, such as AlexNet 7 , VGG 13 , ResNet 33 , which leads to a large amount of storage and computing resources consumption. However, we find that the feature maps output by the hidden  Fig. 2, these feature maps are obtained by the first layer of VGG-11 based on the Urbansound8K dataset. Feature maps marked with blue and black borders have strong similarity, which means that there is a lot of redundancy in convolution operation of CNN model. However, if the redundancy of middle feature maps in CNN model is reduced by simply scaling convolution channel, the accuracy will be reduced. Therefore, maintaining certain redundant feature maps plays a positive role in the final classification results.
In view of the strong similarity and high redundancy of feature maps output by the middle layer of the current mainstream CNN model, and the redundancy plays a positive role in the final classification results. We need to focus on reducing the resources required to generate these similar feature maps, that is, to find a cheap way to replace the filter used to generate these similar feature maps. For an intermediate convolution layer, given the input data X ∈ R c×h×w , where c is the number of input data channels, h and w are the height and width of input data respectively. The ℓ th convolution operation can be expressed as: where * represents convolution operation, θ ℓ ∈ R c ℓ−1 ×c ℓ ×k h ×k w is the weight tensor, k h and k w is the height and width of the filter, b ℓ ∈ R c ℓ is the bias term, norm(x) is the batch normalization (BN) operation 34 , γ ℓ , β ℓ ∈ R ℓ is the scale factor and the offset factor respectively, and h(x) is the activation function. When convolution is performed, the parameter quantity and FLOPs can be obtained by the following formula: At present, CNN generally uses convolution operation with high resolution and high channel number. The number of channels is 256 or 512, even thousands, so the parameters and FLOPs of convolution are huge.
Solution. In view of the above analysis, this paper proposes the RAC module. As shown in Fig. 2, the feature maps output by the convolutional layer are very similar to each other. We believe that for these similar feature maps, we do not need to obtain them through expensive calculations. These similar feature maps are like replicas of inherent feature maps, which have limited performance improvements to the model, but consume a large number of parameters and FLOPs. Therefore, we can generate these redundant feature maps through a series of cheap operations on the basis of inherent feature maps. As shown in Fig. 3, first we generate the inherent feature maps Y ′ ∈ R h×w×c ′ ℓ through the conventional convolution operation: are the convolution parameters used to generate the inherent feature maps, and c ′ ℓ < c ℓ . Then take the inherent feature maps as input data, and generate the remaining feature maps and combine it with the inherent feature maps merged. Enter it as input data to the next layer for processing: where ⊕ indicates that the connection is made on the channel. Compared with the commonly used 3 × 3, 5 × 5, and 7 × 7 convolution operations, the pointwise convolution can almost be ignored in the number of parameters and FLOPs. Moreover, each channel of the feature map obtained by pointwise convolution combines the information of all channels of the inherent feature maps, which makes the feature information contained in it more www.nature.com/scientificreports/ abundant. We can change the compression ratio by adjusting the ratio between the inherent feature maps and the feature maps generated by a cheap method.
Complexity analysis. The RAC module can generate the same number of feature maps as conventional convolutional layers with less resource consumption. Therefore, we can easily use the RAC module to upgrade the existing classical neural network architecture, thereby reducing the computational cost. Next, we will analyze in detail the effectiveness of the RAC module in reducing the number of parameters and FLOPs. We use the RAC module to replace the ℓ th conventional convolution operation. Assuming that the ratio of the number of feature maps generated by pointwise convolution to the number of inherent feature maps is α , then we can use Eq. (7) to calculate the parameter compression ratio of the RAC module compared with ordinary convolution: Similarly, the acceleration ratio for FLOPs can be calculated by Eq. (3) to get ratio F ≈ 1/(1 − α) . A trade-off between computational complexity and accuracy can be achieved by adjusting α. But when α = 1, the RAC module will degenerate into a regular convolution operation with a convolution kernel size of 1 × 1. However, 1 × 1 convolution will lead to performance degradation because it cannot capture the spatial relationship of feature information. Therefore, α should be reasonably valued according to the actual task.
Efficient network construction. Using the efficient RAC module and drawing on the idea of the residual module in ResNet, we designed the RAC block. As shown in Fig. 4, the RAC block integrates the RAC module, shortcut and the channel domain attention mechanism. The main part of the proposed RAC block is composed of two stacked RAC modules. After the first RAC module is over, we add BN 34 and ReLU 35 nonlinear activation layers. From ResNet's experience, only the BN layer is added after the second RAC module, and the ReLU nonlinear activation layer is added after the shortcut operation. The number of channels output by the two RAC modules can be adjusted according to specific needs. RAC block mainly has the following three structures: (a) Two RAC modules have the same output channel; (b) Compared with the second RAC module, the first RAC module has fewer output channels, so the first RAC module plays a role of dimensionality reduction. Through this structure, www.nature.com/scientificreports/ a more compact neural network can be obtained. We call this structure LRAC block; (c) More convolution channels are used in the first RAC module, which we call HRAC block. The first two structures are mainly used in our experiment. Practitioners can choose the most suitable structure according to their actual needs. After the second RAC module is over, we have selectively added the SE module 36 , by processing the obtained feature maps, a one-dimensional vector equal to the number of channels is obtained as the score of each channel, and then the score value is applied to the corresponding channel: Finally, a shortcut connection is established between the input and output of the block: Among them, F(X, θ ) represents the serial calculation of two RAC modules, and θ is the weight parameter of the calculation. When the number of channels of the input data and output data of the block is not uniform, we perform dimensionality increase and dimensionality reduction operations through pointwise convolution to achieve shortcut connections. If the stride of the RAC block is 2, the pointwise convolution with stride of 2 is also used to complete the down-sampling operation.
On the basis of RAC block, RACNN is formed by simple stacking. As shown in Table 1, we follow the advantages of ResNet's basic architecture. For the input samples, we first perform a 3 × 3 convolution operation to extract features and improve the dimension of features. Followed by the RAC block with 16, 32, 64, 128 output channels in turn, and a down-sampling operation of a multiple of 2 is performed as the number of channels increases. Next is the global average pooling (GAP) layer, through which the feature map is turned into a one-dimensional vector, and finally a dense connection layer accompanied by the softmax function is added to complete the classification operation. The dropout operation is also applied to some layers of RACNN. The specific forward propagation process is shown in Fig. 5. The proposed architecture only provides a basic reference. Further tuning of hyperparameters or exploration of the architecture will further improve the performance of RACNN.
For different scenarios in reality, we can use a smaller model to achieve faster resolution or a larger model to achieve higher classification accuracy on specific tasks. We can simply multiply the output channel of each layer  www.nature.com/scientificreports/ by a coefficient μ uniformly, and change the width of the neural network through this coefficient. By adjusting the width coefficient μ, we can easily trade-off between delay and performance.
Different from existing methods. (a) Different from the widely used depthwise convolution, the RAC module can fuse the feature information of multiple channels, fully learn the spatial information of the feature maps, and improve the performance of classification. (b) Different from the Inception series model. Although different kernel sizes are also used in the Inception module at the same time, different from our serial structure, the Inception module uses a parallel structure. This method has the following disadvantages: First, the convolution with different kernel sizes in parallel structure accepts all the feature channels, while the point convolution of the RAC module only accepts some channels. Therefore, in terms of parameters and FLOPs, our method needs Lower than the Inception structure. Secondly, the conclusion drawn from the research in 37 , the operating efficiency of the serial structure is higher than that of the parallel structure, so the RAC module has lower latency. (c) We show through experiments that, in terms of accuracy, the serial structure of the RAC module is also better than the parallel structure used by the Inception module.
Data preprocessing. Feature extraction. The experimental process involves four datasets: UrbanSound8K 38 , ESC-10, ESC-50 39 and TAU-NIGENS Spatial Sound Events 2021 development dataset 40 . Different from speech recognition, environmental sound event (ESE) usually contains more noise, so it is more difficult to recognize. The mel filterbank is closer to imitating the response of the human auditory system. Because the human ear's perception of sound is not linear, it is better described by the non-linear relationship of log. Therefore, Log-Mel is often used to process voice data. Relying on Log-Mel features for neural network training.
Zero-padding. As a public dataset, UrbanSound8K is often used in ESC related research. This dataset contains 10 categories and a total of 8732 samples (≤ 4 s). Among them, there are 1798 less than 4 s, as shown in Fig. 6. However, neural networks usually require fixed-size inputs. If such data samples are discarded, it will cause serious waste of dataset. And for samples of categories such as gun shots, most of the samples are less than 4 s. If only samples with a duration equal to 4 s are used for training, it is very easy to cause over-fitting and reduce model performance. In addition, the length of samples collected in real scenarios is usually difficult to be unified. To this end, we use zero padding method for data repair, that is, for data samples whose duration is less than 4 s, we directly fill in by zero padding. Although this method is very simple, it has shown good performance in the experiment. This method keeps about 20% of the data while ensuring the duration of the data samples is consistent. As shown in Fig. 7, (a)     www.nature.com/scientificreports/ Data augmentation. The ESC-50 data set has a small number of data samples (2000 data in total, 40 in each category), so it is easy to cause over-fitting. We performed data enhancement operations on the audio data to enhance the generalization ability of the model. We mainly performed the following operations on audio data: (a) Pitch shift augmentation. By scaling the frequency to adjust the pitch, we increase and decrease the audio data signal to a certain extent. Here we set the amplitude factor to + 2/−2. (b) Time shift augmentation. The scale changes in the time dimension, and the audio data is stretched or accelerated. In this paper, we stretch the sound clip to 1.2 times its original length, and then cut it to its original length.
In summary, the specific ESC framework proposed in this paper is shown in Algorithm1.

Algorithm1
The pseudo-code for ESC framework using RACNN.  Hyperparameter settings. We use a stochastic gradient descent (SGD) optimizer with a multi-step learning rate strategy to train the proposed model. The momentum weight of the Nesterov momentum we use is 0.9 without damping, and a weight decay of 5 × 10 -4 . Batch size is set to 32. For the ESC datasets, the initial learning rate is set to 0.1. The model on UrbanSound8K is trained for 120 epochs, the learning rate is multiplied by the attenuation coefficient 0.1 every 40 epochs, and the final result is obtained using tenfold cross-validation. The models on ESC-10 and ESC-50 are trained for 300 epochs, the learning rate is multiplied by the attenuation coefficient 0.1 every 100 epochs, and the final result is obtained using fivefold cross-validation. For the SELD dataset, an initial learning rate of 0.001 was used and a decay factor of 0.1 was multiplied every 100 epochs until the loss on the validation set no longer decreased. We take the model that performs the best on the validation set and report its performance on the test set. Finally, we report "mean ± variance".

Input
Compare with parallel structure. The research in Ma et al. 37 has shown that the parallel structure is not conducive to the improvement of computing efficiency. We further test the performance of the two structures on UrbanSound8K. The parallel structure of the RAC module is shown in Fig. 8. We test the performance of the two structures under different compression ratios by adjusting the ratio α of pointwise convolution to the original convolution. The details are shown in Fig. 9. The data in the figure represent the accuracy. Under different ratio α, the accuracy of the RAC module of the serial structure is almost better than that of the RAC module of the parallel structure, in which "Baseline" represents the RACNN model using traditional convolution operation. Therefore, it can be concluded that the RAC module we proposed not only has higher computing efficiency, but also performs well in performance.
Results on the UrbanSound8K dataset. First, we test the accuracy of RACNN under different ratios of α. A s shown in Fig. 10, when α ≤ 0.5, as the parameters and FLOPs decrease, the accuracy of the model is rela-  www.nature.com/scientificreports/ tively stable. When α > 0.5, the accuracy of the model shows a rapid decline. When α = 1, the accuracy is reduced to 85.07% (± 0.64%) (not shown in the Fig. 10), and the accuracy fluctuates greatly. Therefore, keeping a certain number of 3 × 3 convolutions is beneficial to the final result. Based on comprehensive considerations, we select the model obtained by α = 0.5 as our final model for classifying the UrbanSound8K dataset. When α is set to 0.5, not only the overall classification result is high (97.51% (± 0.18%)), the classification result of each class is also outstanding. The confusion matrix is shown in Fig. 11. Except children's playing and street music sounds, the recognition accuracy of other types of sounds exceeded 95%, and the classification accuracy of the three sounds of engine idling, gunshot, and siren even reached 100%. Subsequently, the parameters and FLOPs of the model under different proportions of α are reported, as shown in Table 2, by adjusting the value of α, the accuracy and efficiency can be flexibly balanced.

Results on the ESC-10 dataset.
For the ESC-10 dataset, since the dataset is simpler than the Urban-Sound8K dataset, we simplified the RACNN used on the UrbanSound8k dataset. As shown in Table 3, for each RAC block, we multiply the number of output channels by 0.5, while for the first convolutional layer of RACNN, we did not do any processing. We found that reducing the number of output channels of the first convolutional layer will seriously reduce the accuracy of the model. Therefore, it is necessary to ensure a certain number of convolution channels to fully extract the features of the input data, otherwise it will cause the loss of feature information and affect the performance of the model. We also conduct experiments by changing the value of α and the resource consumption under different α is reported in Table 3. As shown in Fig. 12, when α ≤ 0.5, as the model consumes less resources, the performance of the model does not fluctuate significantly. The performance is best when α = 0.4. And when α = 1, similar to RACNN on UrbanSound8K, there will be a significant decrease in accuracy (67.50% (± 1.76%)) and huge  www.nature.com/scientificreports/ fluctuations. For the selected final model, we not only reported the overall accuracy rate (94.75% (± 0.93%)), but also reported the accuracy rate in different categories. The specific confusion matrix is shown in Fig. 13. RACNN has reached 100% accuracy on the seven sounds of dog, rooster, rain, crying baby, sneezing, clock trick, and helicopter. The accuracy of sea waves, cracking, and chainsaw is also acceptable (88%).

Results on the ESC-50 dataset.
In order to make the experiment closer to real application scenarios and verify the performance of the RACNN model in real scenarios, we conduct experiments on the ESC-50 dataset. The ESC-50 dataset is more complex than the above two datasets, with more classification categories and less training data, so it is very easy to overfit during the training process. For this reason, the width of RACNN has been doubled as a whole, and the larger model capacity enables it to have stronger feature processing capabilities. At the same time, we expand the input feature matrix to 128 × 128. Although this approach will increase   www.nature.com/scientificreports/ FLOPs to a certain extent, the richer feature information significantly improves the classification effect. During the experiment, we found that the use of the SE module did not increase the performance of the RACNN on the ESC-50. For this reason, in the experiment on ESC-50, we removed the SE module in the RAC block.
In order to find a suitable α to achieve a balance between accuracy and efficiency, we compared the accuracy of RACNN models under different α and the resource consumption under different α is reported in Table 4. As shown in Fig. 14, when α = 0.6, the model obtains the best classification accuracy (86.65% (± 0.25%)). When a = 1, the model cannot converge on the test set due to the inability to capture the spatial connection of the time-frequency feature information. In addition, when α = 0.2, 0.3 and 0.4, the performance also decreased relatively. The reason for the analysis is that the amount of training data is small and the classification granularity is fine, so the over-parameterization of the model leads to the phenomenon of over-fitting. At the same time, we also tested the performance of the model in different categories. Because the ESC-50 dataset contains a large number of categories, we did not show the confusion matrix, but showed the accuracy of 50 categories in the form of a histogram. As shown in Fig. 15 10 have also obtained good accuracy, but these models all use the feature fusion method, while RACNN only uses a single Log-Mel spectrogram feature. Despite this, the RACNN model still maintains an advantage in accuracy. Practitioners can choose appropriate sound features to input into RACNN. Since this is not the focus of our work, we have not discussed this in detail. In addition to accuracy, another important evaluation index of ours is the number of parameters and FLOPs, which is also the focus of our work. Since most of the work does not report its FLOPs, and FLOPs are affected by the size of the input data. Therefore,     ESC-50. This dataset is extremely challenging. Its larger number of categories, finer-grained categories and limited trainable data make it difficult for neural network models to fit its data features to achieve high-precision classification. As shown in Table 6, the RACNN model has reached an accuracy rate that competes with stateof-the-art methods (the average accuracy rate is 85.65%, the highest accuracy rate is 86%), and the number of parameters of the RACNN model is still at a minimum level. Because we only use the Log-Mel spectrogram as the input of the model, the FLOPs of the RACNN model are at a low level.

Sound event localization and detection.
In order to verify the generalization of RACNN, we apply RACNN to the task of sound event localization and detection (SELD). SELD is composed of two subtasks, sound event detection (SED) and direction-of-arrival estimation (DOAE), so it is more challenging. CRNN 19 has become the mainstream method in SELD field since it was proposed. Therefore, based on this framework, we combine RACNN to localize and detect sound events. CRNN is mainly composed of three parts: backbone convolutional layers, recurrent layers and transcription layers. The audio data is extracted from the feature sequence by the backbone network and sent to the bidirectional gate recurrent unit (BiGRU) of the recurrent layer for context information learning, and finally the output of the BiGRU is input to the two parallel branches of the fully connected block of the transcription layer to complete the sound event location and detection. We verify the performance of RACNN on SELD by replacing the backbone convolutional network in CRNN with RACNN and training the model using the ACCDOA 49 output format. The specific RACNN structure is shown in Table 7.
We compare RACNN with other lightweight models and model compression methods. In Table 7 21 , using the idea of matrix decomposition to build a lightweight model for SELD 25 . means using model pruning to compress ResNet14. MobileNet-V1 and MobileNet-V2 have also adapted the convolutional channels according to this task. To evaluate the performance of SELD, the official evaluation metrics 52 from the DCASE2021 challenge are introduced in the experiments. As shown in Table 8, RACNN still achieves better performance with similar or even lower number of parameters and floating-point operations. In addition, in order to further verify the effectiveness of the method, we obtained the Uniform model by directly scaling the number of channels of RACNN, but the Uniform did not achieve the desired effect. It shows that maintaining a certain number of feature maps is beneficial to the model, so it is a correct direction to obtain a lightweight model by reducing the generation cost of feature maps.

Conclusion
In this paper, we propose a lightweight resource adaptive convolutional neural network (RACNN). After observing the feature maps output by the hidden middle layer, we found that there are similarities between many feature maps. We consider lower resource consumption to obtain these redundant feature maps. Based on this, we propose the RAC module. It can obtain the same number of feature maps as traditional convolution operations through less resource consumption, and adjust resource consumption according to actual needs. Although the RAC module can simply upgrade the existing CNN, in order to better extract abstract features for classification operations, we propose an efficient feature extraction block-RAC block based on the RAC module, and build RACNN by simply stacking RAC block. We first conduct experiments on the UrbanSound8K, ESC-10 and ESC-50 datasets. Compared with state-of-the-art models, the RACNN model not only maintains a leading position in www.nature.com/scientificreports/ accuracy, but the number of parameters and FLOPs of the RACNN model are much lower than these models. This makes the proposed RACNN model easier to transplant to embedded devices that lack storage and computing resources, and has more real-time processing capabilities. We also use RACNN for SELD task, demonstrating it's excellent generalization performance. In the work of this article, we only use a single feature of the Log-Mel spectrogram. In future work, we will evaluate the performance of different features and mixed features on RACNN, so as to give full play to the performance of RACNN and improve the generality of the model. In addition, we will also consider fusing RACNN with the current mainstream CNN compression methods to further reduce its running cost and improve inference speed.